Motivation

Formations and player roles are a very vague concept. We simplify things into 4-2-3-1s and 4-3-3s and right backs and centre forwards but within those simplifications there are many nuances to how different teams and different players function.

In this post, I try to quantify these and find players who play similar roles in their respective teams.

This logic simplifes the role of a player to moving the ball from one part of the pitch to another. This by itself isn’t a sufficient measure of similar players, but it may still be a measure of similar roles. For instance it doesn’t include physical attributes of pace, stamina, etc. quality of shooting, passing, etc. quality of awareness, positioning, etc. This results from this logic could be used in addition to these other measures for better results.

Methodology

To compare players in isolation

  1. Isolate periods of longer than 60 minutes during which no substitutions happened, no formation changes happened. Remove the rest of the data.

  2. For each player, in each match, extract four sets of loctions -
    • all the locations where the player received a pass from i.e. the location of the passer who passed the ball to this player. I will address this as (xstart1, ystart1)
    • all the locations where the player received a pass at i.e. the location of the player when he received the ball I will address this as (xend1, yend1)
    • all the locations where the player passed from i.e. the location of the player when he passed the ball to another player. I will address this as (xstart2, ystart2)
    • all the locations the player passed to i.e. the location of the recipient of the pass when he received the ball. I will address this as (xend2, yend2)

eg. if we are interested in player B, then for a play in which player A passed to player B and then player B passed to player C, we’d see two passes - pass 1 which went from A at location Start1 to B at location End1, and then pass 2 which went from B at location Start2 to C at location End2, with player B having moved with the ball from location End1 to location Start2.

Each of those four points contribute respectively to the four different datasets desribed above.

  1. Compare each of those four distribution with the respective distributions of all other players in their matches. Use a distance measure ( earth mover’s distance ) to decide whether the distributions of two players are similar.

  2. Combine the EMD between each set of distances to come up with an overall distance. I add the four distances up as if they were Euclidan distances, i.e. overall distance = ( ( Distance1 ^ 2 ) + ( Distance2 ^ 2 ) + ( Distance3 ^ 2 ) + ( Distance4 ^ 2 ) ) ^ 0.5

To compare teams

Using the player comparison metric, compare all eleven players of a team in one match with all eleven players of a team in another match. Find a one to one pairing between players such that the sum of distances within all eleven pairs is the smallest. This can be solved as an assignment problem, i.e. for each comparison, of all the possible ( 11 * 10 * 9 … * 1 ) = 11! possible pairings between the 11 players in each match, the one with the least sum of distances across each player pair is chosen as the best pairing.

Here are all the 11 X 11 distances with the final pairs highlighted for some comparisons to give you an idea of how this works.

The sum or average of these eleven distances or is a measure of how similarly the two teams passed which can be assumed to be a proxy for how similar the two teams’ playing strategy was. The smaller the sum of distances, the closer the formations of the two teams. Instead of the average distance, the maximum distance amongst these eleven distances is a stricter measure which can be used for the same purpose and this is what I prefer.

EMD illustration

It is a measure of the amount of change needed to transform one distribution to another. For instance, if you have two distributions, one concentrated 100% at (0,0) and another 50% at (0,1) and 50% at (1,1) then the EMD between them would be 1.2071068 = 0.5 * ( distance of moving a point from c(0,1) to c(0,0) = 1 ) + 0.5 * ( distance of moving a point from c(1,1) to c(0,0) = 1.4142135 )

Here is an example from the dataset. Three distributions of points on the pitch, and the EMD between them are shown below. These three distributions are of locations from where Toby Alderweireld made passes against these teams.

Match1 Match2 DistancePassXY
A vs Huddersfield H vs Leicester 5.251464
A vs Huddersfield A vs West Brom 22.282919
H vs Leicester A vs West Brom 24.657782

Note how the game against WBA as a very different set of points where passes were made from, and accordingly the distance between that game and the other two games is much higher than the distance between the other two games.

Additional notes

  • Passing data is probably more reflective of attacking similarity than defensive similarity. Like how I simplified attacking play to moving the ball from one area of the pitch to another, we could consider defense being about not letting the ball move into some area of the pitch. That would need off the ball positioning data which I don’t have access to.

  • For more role specific distances, the distance measure can be extended to any other spatial distribution as well, eg. locations where the player shoots from, where the player tackles, etc. While defensive actions get ignored, my hope is that offensive actions indirectly still get captured, for instance if you expect to see a player shoot from certain positions, you may have less passes from the same position for him and that should cause the distance to increase between him and another player who tends to pass from that same position.

  • I could have chosen to calculate the distance between the start and end coordinates of the passes as two four dimensional data (xstart1, ystart1, xend1, yend1) and (xstart2, ystart2, xend2, yend2). I instead choose to calculate the distance between the set of starting points and the ending points separately as four two dimensional datasets (xstart1, ystart1), (xend1, yend1), (xstart2, ystart2), and (xend2, yend2). Cons - A player who passes from xy1 to xy2, and xy3 to xy4, and another player who passes from xy1 to xy4, and xy3 to xy2, would have 0 distance in the chosen methodology which would be incorrect. Pros - it is easier to understand the reasons for the overall distance between two distributions. I haven’t verified this but my hope is that the case in the cons is probably rare and the interpretability is worth that loss. I’ll test this at a later point in time.

  • Other aspects of a team’s strategy, such as the playing style of opposition, what’s at stake, the situation in the game, red cards, etc. have been ignored for sake of simplicity. It should be trivial to match the playing style of the respective oppositions using the same method that we’re using to match the players and the teams.

Data used

  • I only used data from matches for the 2017-18 season of the EPL, the EFL, Ligue 1, La Liga, Bundesliga, and Serie A. As a case study, I’ve taken a few players who were rumoured to be or were on the move around that time.

Cases

I walk through the process for these players, which cover most areas of the pitch, for one match each. Repeating this process for other matches of the same player in a similar role, and looking at the candidates getting shortlisted more often across these matches would be a strong list of players who play very similarly to the respective player.

I’ve looked at Toby Alderweireld’s case in some detail to explain some aspects of how I would go about it. The other five players are left for you to go through.

Toby Alderweireld

His Own Performance

Finding players and teams who played similarly to Toby Alderweireld for Tottenham (H) vs Leicester

He looks to have played in an RCB position of a 4-2-3-1.

His similarity with other players who match his distributions, and the overall simimlarity of the team look as below -

Points towards the bottom indicate that the respective player had similar spatial distributions of the four distributions mentioned in the Methodology section to that of Toby Alderweireld in this match. Points towards the left, similarly, indicate that the player with the farthest distance with his paired player was also small which indicates that the team played very similarly overall to how Tottenham played in this match.

Shortlist

Only player distance

Regardless of team distance, players who had some of smallest individual distances from Toby Alderweireld.

Chart 1 -

I compare all the performances of the shortlisted players with this performance of Toby Alderweireld. The blue dots are cases where the player got paired with Toby Alderweireld on comparing the two teams in their two respective matches. The red dots are cases where the player wasn’t paired with Toby Alderweireld but was paired with someone else.

As before, points close to the left bottom of the chart are the ones of interest -

  • Within this shortlist, a player like Cedric Yambere is probably a player more similar to Toby Alderweireld and more used to playing in a similar system as him than John O’Shea. JOS has a few performances which are vvery similar to TA, such as his game against Aston Villa away, which help him get shortlisted, but such performances are very few in number.

  • There might also be players who play in multiple positions or in different styles, who may have some points very close to the left bottom but not all of them, such as Grant Hanley. And maybe John O Shea too. This is a little easier to assess in the second chart.

Looking at Toby Alderweireld’s comparisons with his own performance in other matches, playing West Brom away seems to have him playing very differently than what he tends to on most other occassions. This is the same example from the EMD illustration section and we already have an idea of why Toby Alderweireld’s performance in that match is so different.

Chart 2 -

The breakdown of the individual distances for each point can be seen in these charts. This is helpful to get a better understanding of the differences. For instance, the matches in which Federico Fernandes matched with Alderweireld, the blue lines, all seem to have a trend of being more different on where he passes the ball to, but less different on where he himself passes from. You can also see Grant Hanley clearly playing a different role in the matches he isn’t paired with Toby Alderweireld and can infer that where he receives the ball at and where he passes the ball from is the main reason that the role is different.

Only team distance

Regardless of player distance, teams in specific matches who had some of smallest distance. The player that pairs with Toby Alderweireld in those matches is the shortlisted player.

( Toby Alderweireld had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Team and player distance combined

Finding the players at the least distance from Toby Alderweireld by giving equal weightage to player and team distance. This is my preferred way of looking at things because a player in a similar role playing for a team playing in a similar way is a better pairing than either of those two conditions separately.

( Toby Alderweireld had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Same team alternatives

For sake of curiosity and validation, a look at other players from the same team who matched with Toby Alderweireld in some other matches.

The points are quite spread out on teamDistance despite the individual player distance hovering mostly around a consistent mark. The player himelf didn’t maange to make the shortlist for the two shortlisting criteria which included team distance. This may indiciate Spurs employing a wide variety of strategies where the role of the RCB was mostly consistent.

Quite a few points are in the area similar to what we’ve observed in the earlier shortlists, player distance < ~20 and team distance < ~40. Even though Davinson Sanchez doesn’t show up in the closest 15 / 16 that we used for illustration purpose, he’s individually still playing in a very similar role to Toby Alderweireld in some matches. With these sort of numbers, he may still show up in shortlists for some of the other matches.

Not Shortlist

Some players who played in the same position as Toby Alderweireld in this match, RCB, in at least one match but didn’t played very similarly to him in those matches. These are players that should probably be avoided. I’ve included Toby Alderweireld’s performance as a reference.

Kyle Walker

His Own Performance

Finding players and teams who played similarly to Kyle Walker for Man City (H) vs Burnley

Shortlist

Only player distance

Only team distance

Team and player distance combined

Same team alternatives

Not Shortlist

Emre Can

His Own Performance

Finding players and teams who played similarly to Emre Can for Liverpool (H) vs Swansea

Shortlist

Only player distance

( Emre Can had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Only team distance

( Emre Can had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Team and player distance combined

( Emre Can had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Same team alternatives

Not Shortlist

Aaron Ramsey

His Own Performance

Finding players and teams who played similarly to Aaron Ramsey for Arsenal (H) vs Bournemouth

Shortlist

Only player distance

Only team distance

Team and player distance combined

Same team alternatives

Not Shortlist

Riyad Mahrez

His Own Performance

Finding players and teams who played similarly to Riyad Mahrez for Leicester (A) vs Stoke

Shortlist

Only player distance

( Riyad Mahrez had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Only team distance

( Riyad Mahrez had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Team and player distance combined

( Riyad Mahrez had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Same team alternatives

Not Shortlist

Aleksandar Mitrovic

His Own Performance

Finding players and teams who played similarly to Aleksandar Mitrovic for Fulham (H) vs Reading

Shortlist

Only player distance

( Aleksandar Mitrovic had to be force included in this list as he didn’t fall in the top 16 players by this criterion. )

Only team distance

Team and player distance combined

Same team alternatives

Not Shortlist

Quality of results

Strengths

  • The examples presented in the methodology section seem sensible. The same player matches with the same player or matches with someone who plays in a similar role.

  • In most cases, the shortlists drawn look like players who play in a similar manner to the player under consideration as well.

  • The strong presence of Wijnaldum and Henderson in Emre Can’s shortlist, and similarly the strong presence of various Arsenal players in Aaron Ramsey’s list indicate there is some underlying strategy to each team that this logic is able to pick out.

  • Except for Mahrez and Mitrovic, who don’t seem to have been rotated often. all the other players have a reasonable looking list of players from the same team who played in a similar role in some other matches.

  • It was hard to find good matches for Mahrez primarily due to the very unusual playing strategy his team adopted. Both central midfierlders, Wilfried Ndidi and Vicente Iborra, look like they were playing in an RCM sort of position with no CM or LCM. As a team, none of their other performances were very similar to their performance in this match.

Speculative Strengths

  • Maya Yoshida and Toby Alderweireld are both Southampton alumni, along with Alderweireld’s current manager, Mauricio Pochhetino. MP hadn’t managed TA but had managed MY while at Southampton.

  • Marchisio appearing in Emre Can’s list. EC’s eventual move to Juventus was to replace him?

  • Alexander Oxlade Chamberlain in Ramsey’s list. AR was blocking the spot in centre midfield that AOC wanted which is why AOC moved away from Arsenal eventually?

Weaknesses

  • This model needs a player to be involved in a sufficient number of passes for the logic to have enough data to work with. This is the reason I excluded goalkeepers from the cases I looked at. For teams where the forwards are left isolated and are involved in very few passes, or the keepers are not involved much, this may cause an artificially high distance between the teams. A possible fix might be to weigh the distance by the number of passes?

  • Without giving benefit of doubt that the logic was picking out something that is contrary to expectation but still correct, there is definitely more resilience to noise needed given occasional oddities such as Kevin De Bruyne pairing with Kyle Walker, Shinji Okazaki pairing with Riyad Mahrez, etc. Given the low occurrence levels, I expect this to not be a problem when aggregating this across multiple matches.

Scope for usage

While the underlying usage remains to identify players in similar roles, this concept of pairing players allows the formation of a baseline against which players could be compared. For instance, you’d expect different pass completion percents from players playing in different parts of the pitch but a comparison of a player only with other players that they have been paired with is a more reasonable and useful comparison.

You could also simulate your own players. If you wanted an LCB who plays similar to how Toby Alderweireld plays, you could just make mirror images of his passes across the left and the right half of the pitch and find matches for this new set of passes. You could go a step further and create your own player by creating your own data of the sort of passes you expect a player to be making and then look for such a player in the database.

Get in touch

Do you have suggestions, comments, new ideas to build on top of this, etc.? I’d love to hear. Find me on Twitter - @thecomeonman.